Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate search entries #981

Merged
merged 3 commits into from
Sep 18, 2024

Conversation

BPerlakiH
Copy link
Collaborator

@BPerlakiH BPerlakiH commented Sep 15, 2024

Fixes: #980
Superseeds #979

What we really want is unique search results, which is in fact unique URLs that the user can click on and navigate further. If 2 search results are linking to the same URL, that's a duplicate.

Solution

Store a set of foundURLs as we are processing the search results (both indexed and title search), and look for duplicates. Also make sure we adjust the URL endings to always contain the trailing "/" slash, but this is only for our comparison, the returned search result URLs should not be modified.

After the fix on macOS:

Screenshot 2024-09-15 at 13 47 36

@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2024

Codecov Report

Attention: Patch coverage is 2.08333% with 47 lines in your changes missing coverage. Please review.

Project coverage is 39.19%. Comparing base (9093134) to head (2065328).

Files with missing lines Patch % Lines
SwiftUI/Model/SearchOperation/SearchOperation.mm 2.08% 47 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #981      +/-   ##
==========================================
+ Coverage   37.23%   39.19%   +1.96%     
==========================================
  Files         111      111              
  Lines        6346     6360      +14     
==========================================
+ Hits         2363     2493     +130     
+ Misses       3983     3867     -116     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@HiroyasuNishiyama
Copy link
Contributor

@BPerlakiH Thank you for your fix.

I think one issue is how to handle the score of search results when there are duplicate entries.
In the original code, title search results are ranked higher than index search results and are displayed higher in the result list.
Since the title search result score seems more natural, I tried to reflect that rather than simply dropping it in my fix.

@BPerlakiH
Copy link
Collaborator Author

BPerlakiH commented Sep 15, 2024

Thank you @HiroyasuNishiyama, that is a good point.
I did addressed 2 concerns in my latest changes, one described in the issue itself, about failing archives, and loosing results on those.

I also changed the order of execution, so we do the title search first, and then the indexed search, this way if we drop duplicates we do that from the indexed ones, leaving the probability: null, which I think would be equivalent to your solution.

Please note that we do another sorting on these results here:

// calculate score for all results
for result in results {
guard !isCancelled else { return }
let distance = WagnerFischer.distance(result.title.lowercased()[...], searchText[...])
if let probability = result.probability?.doubleValue {
result.score = NSNumber(floatLiteral: Double(distance) * Foundation.log(7.5576 - 6.4524 * probability))
} else {
result.score = NSNumber(integerLiteral: distance)
}
}

@HiroyasuNishiyama
Copy link
Contributor

HiroyasuNishiyama commented Sep 16, 2024

@BPerlakiH
I think simply changing the order to do title search first introduces another issue.
The results of title search do not include snippets, so if there is a duplicate entry, the snippet will be dropped from the result.

And I realized that my fix had a misunderstanding.
To match the current search scores, I think the probability value should be set to the larger of 0.75 (i.e. the solution of log(7.5576 - 6.4524 * x)=1) or the title search result probability rather than setting it to nil.
Though, I know it's ugly to introduce such a constant into this code.

FYI, no exceptions occur during index searches in my test environment.

@rgaudin
Copy link
Member

rgaudin commented Sep 16, 2024

I think it would be good to have @kelson42's opinion here, in regard to how this should compare with other readers.

@kelson42
Copy link
Contributor

At this stage, I would recommend to stop investing more time on this PR. Like I suspected, the solution is not straight.

We should think out of the box about this. IMO whatever we decide do, this is not the role of the reader to do this kind of "smart" computations.

@HiroyasuNishiyama
Copy link
Contributor

As a user, I like the current Apple Kiwix search interface, where I do not have to worry about the indexing method.
And I would prefer that redundant entries be excluded from the results from the search results.
Of course, it is up to your team to decide what design choices to make.

For reference, I would like to leave my current thoughts on this PR.
My first PR and this PR removes redundant search results in the Objective C++ module (SearchOperation.mm).
Since this causes issues in handling the deletion results as mentioned above, I think it would be better to delete identical entries so that the first entry remains after the sorting process in swift (SearchOperation.swift) shown by @BPerlakiH .
As an addition, we need to add a process to leave the snippet of the removed entry.

@BPerlakiH
Copy link
Collaborator Author

Ok, to sum up what's in this PR already:

  • removing duplicate entries from search results (I reversed the title/index now, so snippets are there)
  • try / catching errors on searches by each ZIM file (and search type), this way we avoid no results (or less results) if one of the ZIM files is invalid.

For future discussion:

  • we could improve to use only one way of searching, if there are absolutely no results from that, maybe only in that case fall back to another type of search
  • possible improvement on the OpenZIM side, is to have snippets for title searches as well, eg. returning the first sentence / paragraph of the article (currently we get back the title as the snippet, which we can not / do not use).

IMHO: this PR is already an improvement from what we had. I also do agree with @kelson42 that this is still not an ideal place to be, and a broader / more universal approach is needed for search handling, that would unify (cross platform) and simplify the reader side.

@kelson42
Copy link
Contributor

@BPerlakiH @rgaudin @HiroyasuNishiyama I will merge this PR because it seems to fix the bug reported originaly. For the long standing problem/improvement, I have created a dedicated issue.

@kelson42 kelson42 force-pushed the 980-remove-duplicate-entry-from-search-results branch from 2065328 to 990e51f Compare September 18, 2024 09:17
@kelson42 kelson42 removed this from the 3.6.0 milestone Sep 18, 2024
@kelson42 kelson42 merged commit c9043bf into main Sep 18, 2024
4 checks passed
@kelson42 kelson42 deleted the 980-remove-duplicate-entry-from-search-results branch September 18, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate entries in search results
5 participants